Collection of Internet

home *** CD-ROM | disk | FTP | other *** search

/ Collection of Internet / Collection of Internet.iso / infosrvr / dev / www_talk.930 / 000530_fine@cis.ohio-state.edu _Fri Jan 8 21:23:57 1993.msg < prev next >

Wrap

Internet Message Format | 1994-01-24 | 4KB

Return-Path: <fine@cis.ohio-state.edu> Received: from dxmint.cern.ch by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA02265; Fri, 8 Jan 93 21:23:57 MET Received: by dxmint.cern.ch (5.65/DEC-Ultrix/4.3) id AA03129; Fri, 8 Jan 1993 21:38:55 +0100 Received: by soccer.cis.ohio-state.edu (5.61-kk/5.911008) id AA14229; Fri, 8 Jan 93 15:38:20 -0500 Date: Fri, 8 Jan 93 15:38:20 -0500 From: Thomas A. Fine <fine@cis.ohio-state.edu> Message-Id: <9301082038.AA14229@soccer.cis.ohio-state.edu> To: connolly@pixel.convex.com, @cis.ohio-state.edu@cis.ohio-state.edu Subject: Re: dealing with new-lines Cc: www-talk@nxoc01.cern.ch X-Mailer: Perl Mail System v1.1 >Darn good question. Your approach appears to have the correct >results, but I'm not sure it's practical for many implementations >(global search-and-replace operations are inconvenient for >sequential processing models), and it certainly isn't a healthy >way to think about SGML documents. But most browsers seem to have cacheing anyway, which means they can do global search/replace. But you can still do it more or less sequentially. Just buffer strings of new-lines until you know what follows them, and then deal with it. There's no method you can propose which is correct and doesn't involve storing something somewhere. >The way to think about SGML documents, IMHO, is this: the sequence >of characters comprising an SGML document are presented to an >SGML parser, which parses the markup from the data and passes >the "results" to the processing application. This is another alternative I considered. But I figured that I have to deal with various parsing things when I read the HTML anyway. I was just going to take each chunk of data, (with anchors pre-processed out) and remove all whitespace at the beginning and end (except for PRE sections and such). But if someone put in whitespace, why should I muck with it? Who knows, they might have even wanted it there. >>1. For each tag NOT in >> <PRE> </PRE> <A> </A> <PLAINTEXT> >> remove ALL surrounding new-lines. > >First, let's get one thing straight: the PLAINTEXT element as >described by the original HTML documentation is not representable >in SGML. For my purposes, I consider the HTML document to >end at the <PLAINTEXT> tag, and I consider the rest of the >data stream to be an RFC-822 message body or a MIME text/plain body, >and not SGML at all. I hadn't meant otherwise. But you have to read it in anyway, and since my method deals with things prior to any other parsing, you treat it all as one clump. >Next, let's keep in mind that you can't do things like the following >global substitition, >s/\n+(<(H1|H2|ADDRESS...))>/$2/g; >because it might find things that look like tags but aren't, >for example > ><foo bar=" ><H1>this is a little cooky, but nontheless legal and possible."> > >But even if you're using a proper SGML parser, consider: > ><H1>Here we go! ><a href="#xyz">click here</a> >There we went! ></H1> > >The parser will return an H1 start tag, and then the >string "Here we go!\n". At this point, your rule doesn't >tell me what to do with the newline. I have to get >the next object before I decide. Like I said before, You have to do some sort of storage at some point anyway. >Hmm... I guess that's reasonable. But I'd rather just pass all the Like I said before, You have to do some sort of storage at some point anyway. >My point is: don't use whitespace to represent significant >information except in the PRE elemnt. Use the tags that >are defined to have significance. I suppose I agree with this more or less, at least from the point of view of generating my own code. But we have to make something clear - can a browser keep all the whitespace if it wants to? Or in other words, can an html generator assume collapsing whitespace, or just be aware that it might happen? tom